Name | Version | Summary | date |
rftokenizer |
2.3.0 |
A character-wise tokenizer for morphologically rich languages |
2024-12-17 19:05:30 |
alphacodings |
0.2.0 |
base26 ([A-Z]) and base52 ([A-Za-z]) encodings |
2024-12-09 03:04:43 |
QuickBPE |
2.1 |
A fast BPE implementation in C |
2024-12-05 11:37:29 |
zhon |
2.1.1 |
Zhon provides constants used in Chinese text processing. |
2024-11-20 00:29:10 |
llama-tokens |
0.0.3 |
A Quick Library with Llama 3.1/3.2 Tokenization - source https://github.com/jeffxtang/llama-tokens |
2024-11-10 17:03:39 |
eKoNLPy |
2.0.6 |
A Korean natural language processing toolkit for economic analysis |
2024-11-02 00:02:54 |
huspacy |
0.12.0 |
HuSpaCy: industrial strength Hungarian natural language processing |
2024-10-28 10:30:55 |
miditok |
3.0.4 |
MIDI / symbolic music tokenizers for Deep Learning models. |
2024-09-15 10:43:00 |
maze-dataset |
1.1.0 |
generating and working with datasets of mazes |
2024-09-10 19:33:49 |
taibun |
1.1.7 |
Taiwanese Hokkien Transliterator and Tokeniser |
2024-08-31 20:25:01 |
bpeasy |
0.1.3 |
Fast bare-bones BPE for modern tokenizer training |
2024-08-23 10:47:52 |
simplemma |
1.1.1 |
A lightweight toolkit for multilingual lemmatization and language detection. |
2024-08-08 12:20:45 |
textmate-grammar-python |
0.6.1 |
A lexer and tokenizer for grammar files as defined by TextMate and used in VSCode, implemented in Python. |
2024-07-31 18:43:24 |
process-twarc |
0.20.2 |
Tools for transforming raw data from Twarc2 to structured data for Masked Language Modeling. |
2024-06-12 11:40:55 |
example990420 |
1.1.1 |
Taiwanese Hokkien Transliterator and Tokeniser |
2024-05-01 20:28:38 |